New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add startupProbe result handling to kuberuntime #84279
Add startupProbe result handling to kuberuntime #84279
Conversation
/sig node |
d51c86c
to
ddd5eec
Compare
/this-is-unbearable |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks! From the way I read the docs, the KEP and the changes in this PR, I think the comments in the tests should be updated. Isn't the correct behavior to restart the container when the startupProbe fails (not depending on the result of the liveness probe)? kubernetes/test/e2e_node/startup_probe_test.go Lines 88 to 93 in ea4570a
More detailed comment here: #84178 (comment) |
pkg/kubelet/kubelet.go
Outdated
klog.V(1).Infof("SyncLoop (container unhealthy - liveness): %q", format.Pod(pod)) | ||
handler.HandlePodSyncs([]*v1.Pod{pod}) | ||
} | ||
case updateStartup := <-kl.startupManager.Updates(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will "race" (not in the sense of a memory race, but the fact that only one of them will get updates to the channel) against this update mechanism, and ignore Success results:
kubernetes/pkg/kubelet/prober/prober_manager.go
Lines 305 to 310 in 323f99e
func (m *manager) updateStartup() { | |
update := <-m.startupManager.Updates() | |
started := update.Result == results.Success | |
m.statusManager.SetContainerStartup(update.PodUID, update.ContainerID, started) | |
} |
Not sure if that is intended or not..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the channel isn't shared... one listens from startupManager.Updates
and the other from livenessManager.Updates
. (maybe I don't get the point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Looks like the fact that they don't share the channel is true, since it is created two startupManagers (in this PR).
Would the startupManager
created in pkg/kubelet/kubelet.go
ever run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope so... but I have never really worked on that code...
ddd5eec
to
440c44e
Compare
Normally in that test we should have |
The command is So, should a pod like this restart or just start normally: cmd := []string{"sleep", "inf"}
livenessProbe := &v1.Probe{
Handler: v1.Handler{
Exec: &v1.ExecAction{
Command: []string{"true"},
},
},
}
readinessProbe := &v1.Probe{
Handler: v1.Handler{
Exec: &v1.ExecAction{
Command: []string{"true"},
},
},
}
startupProbe := &v1.Probe{
Handler: v1.Handler{
Exec: &v1.ExecAction{
Command: []string{"false"},
},
},
} |
restart and crash loop backoff |
Actually the difference between the two tests is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/uncc
Deferring to @odinuge, who is already doing a great job with the review :)
/test pull-kubernetes-e2e-kind |
5e51f45
to
89d4705
Compare
Squashed and ready for review! Thanks a lot for all your help @odinuge ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @matthyx! Hopefully we can get this into beta in v1.17!
/lgtm
Failure Result = 1 | ||
|
||
// Unknown is encoded as -1 (type Result) | ||
Unknown Result = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill defer the naming of this to the approvers (eg. Pending
)
Failure Result = 1 | ||
|
||
// Unknown is encoded as -1 (type Result) | ||
Unknown Result = -1 | ||
) | ||
|
||
func (r Result) String() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the new Result
type here (and in ToPrometheusType
) as well? I think the current behavior is ok, reporting -1/"UNKNOWN", but it may be nice to be explicit. Ill defer that to the approvers 😄
/assign @tallclair |
@Random-Liu, @dchen1107, @derekwaynecarr, @tallclair, @vishh, @yujuhong this PR is really needed for 1.17, can you please make sure we don't miss code-freeze? |
@@ -732,6 +733,7 @@ func TestComputePodActions(t *testing.T) { | |||
mutatePodFn func(*v1.Pod) | |||
mutateStatusFn func(*kubecontainer.PodStatus) | |||
actions podActions | |||
resetStatusFn func(*kubecontainer.PodStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we createTestRuntimeManager
inside the loop? So that we don't need to reset the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense... maybe it could go into another PR?
// Success is encoded as "true" (type Result) | ||
Success Result = true | ||
// Success is encoded as 0 (type Result) | ||
Success Result = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startup probe logic is almost the same with liveness probe, right?
Why Success/Failure is sufficient for liveness probe, but not startup probe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Because the init values of liveness probe and startup probes are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Unknown
is newly introduced, and only used for initializing startup probe, I think this should be safe.
/lgtm with one small concern. Asked @yujuhong take a quick look. If she is fine with it, I will approve it. |
pkg/kubelet/prober/worker_test.go
Outdated
@@ -21,7 +21,7 @@ import ( | |||
"testing" | |||
"time" | |||
|
|||
"k8s.io/api/core/v1" | |||
v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "v1"
Failure Result = 1 | ||
|
||
// Unknown is encoded as -1 (type Result) | ||
Unknown Result = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just use iota?
const (
Unknown Result = iota -1
Success
Failure
)
89d4705
to
66595d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@dchen1107 I think you can approve when you're back! |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, matthyx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @dchen1107 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
StartupProbe failure of a container should (by design) trigger a restart of the Pod, as it's the case for livenessProbes.
Which issue(s) this PR fixes:
Fixes #84178
Does this PR introduce a user-facing change?: